Web Spam Detection: New Approach with Hidden Markov Models
نویسندگان
چکیده
Web Spam is the result of a number of methods to deceive search engine algorithms so as to obtain higher ranks in the search results. Advanced spammers use keyword and link stuffing methods to create farms of spam pages. Most of the recent works in the web spam detection literature utilize graph based methods to enhance the accuracy of this task. This paper is basically a probabilistic approach that uses content and link based features to detect the web spam pages. Since we observe there is a high connectivity between web spam pages, we adopt a method based on Hidden Markov Model to exploit conditional dependency of a sequence of hosts and their spam/normal class distribution of each host. Experimental results show that the proposed method can significantly improve the performance of baseline classifier.
منابع مشابه
A New Hybrid Approach of K-Nearest Neighbors Algorithm with Particle Swarm Optimization for E-Mail Spam Detection
Emails are one of the fastest economic communications. Increasing email users has caused the increase of spam in recent years. As we know, spam not only damages user’s profits, time-consuming and bandwidth, but also has become as a risk to efficiency, reliability, and security of a network. Spam developers are always trying to find ways to escape the existing filters therefore new filters to de...
متن کاملAn Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models
With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or nonspam category. The proposed work considers unsupervised learning algorithms to characterize the web pages based on the link based features and content based f...
متن کاملSegmental parameterisation and statistical modelling of e-mail headers for spam detection
‘Spammers exploit the popularity and low cost of e-mail services to send unsolicited messages (spam), which fill users’ accounts and waste valuable resources. To combat this problem, many different spam filtering techniques have been proposed in the literature. Nevertheless, most current anti-spamming filtering schemes are based on detecting relevant terms or tokens in the entire message or in ...
متن کاملA Novel Hybrid Approach for Email Spam Detection based on Scatter Search Algorithm and K-Nearest Neighbors
Because cyberspace and Internet predominate in the life of users, in addition to business opportunities and time reductions, threats like information theft, penetration into systems, etc. are included in the field of hardware and software. Security is the top priority to prevent a cyber-attack that users should initially be detecting the type of attacks because virtual environments are not moni...
متن کاملA structural, content-similarity measure for detecting spam documents on the web
Purpose The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the qua...
متن کامل